---
title: Building Voice Agents
description: Learn how to build voice agents using the OpenAI Agents SDK, what features are available, how to architecture your application, and more.
---

import { Steps, Aside, Code } from '@astrojs/starlight/components';
import createAgentExample from '../../../../../../examples/docs/voice-agents/createAgent.ts?raw';
import multiAgentsExample from '../../../../../../examples/docs/voice-agents/multiAgents.ts?raw';
import createSessionExample from '../../../../../../examples/docs/voice-agents/createSession.ts?raw';
import configureSessionExample from '../../../../../../examples/docs/voice-agents/configureSession.ts?raw';
import handleAudioExample from '../../../../../../examples/docs/voice-agents/handleAudio.ts?raw';
import defineToolExample from '../../../../../../examples/docs/voice-agents/defineTool.ts?raw';
import toolApprovalEventExample from '../../../../../../examples/docs/voice-agents/toolApprovalEvent.ts?raw';
import guardrailsExample from '../../../../../../examples/docs/voice-agents/guardrails.ts?raw';
import guardrailSettingsExample from '../../../../../../examples/docs/voice-agents/guardrailSettings.ts?raw';
import audioInterruptedExample from '../../../../../../examples/docs/voice-agents/audioInterrupted.ts?raw';
import sessionInterruptExample from '../../../../../../examples/docs/voice-agents/sessionInterrupt.ts?raw';
import sessionHistoryExample from '../../../../../../examples/docs/voice-agents/sessionHistory.ts?raw';
import historyUpdatedExample from '../../../../../../examples/docs/voice-agents/historyUpdated.ts?raw';
import updateHistoryExample from '../../../../../../examples/docs/voice-agents/updateHistory.ts?raw';
import customWebRTCTransportExample from '../../../../../../examples/docs/voice-agents/customWebRTCTransport.ts?raw';
import websocketSessionExample from '../../../../../../examples/docs/voice-agents/websocketSession.ts?raw';
import transportEventsExample from '../../../../../../examples/docs/voice-agents/transportEvents.ts?raw';
import thinClientExample from '../../../../../../examples/docs/voice-agents/thinClient.ts?raw';
import toolHistoryExample from '../../../../../../examples/docs/voice-agents/toolHistory.ts?raw';
import sendMessageExample from '../../../../../../examples/docs/voice-agents/sendMessage.ts?raw';
import serverAgentExample from '../../../../../../examples/docs/voice-agents/serverAgent.ts?raw';
import delegationAgentExample from '../../../../../../examples/docs/voice-agents/delegationAgent.ts?raw';
import turnDetectionExample from '../../../../../../examples/docs/voice-agents/turnDetection.ts?raw';

## Audio handling

Some transport layers like the default `OpenAIRealtimeWebRTC` will handle audio input and output
automatically for you. For other transport mechanisms like `OpenAIRealtimeWebSocket` you will have to
handle session audio yourself:

<Code lang="typescript" code={handleAudioExample} />

## Session configuration

You can configure your session by passing additional options to either the [`RealtimeSession`](/openai-agents-js/openai/agents-realtime/classes/realtimesession/) during construction or
when you call `connect(...)`.

<Code lang="typescript" code={configureSessionExample} />

These transport layers allow you to pass any parameter that matches [session](https://platform.openai.com/docs/api-reference/realtime-client-events/session/update).

For parameters that are new and don't have a matching parameter in the [RealtimeSessionConfig](/openai-agents-js/openai/agents-realtime/type-aliases/realtimesessionconfig/) you can use `providerData`. Anything passed in `providerData` will be passed directly as part of the `session` object.

## Handoffs

Similarly to regular agents, you can use handoffs to break your agent into multiple agents and orchestrate between them to improve the performance of your agents and better scope the problem.

<Code lang="typescript" code={multiAgentsExample} />

Unlike regular agents, handoffs behave slightly differently for Realtime Agents. When a handoff is performed, the ongoing session will be updated with the new agent configuration. Because of this, the agent automatically has access to the ongoing conversation history and input filters are currently not applied.

Additionally, this means that the `voice` or `model` cannot be changed as part of the handoff. You can also only connect to other Realtime Agents. If you need to use a different model, for example a reasoning model like `o4-mini`, you can use [delegation through tools](#delegation-through-tools).

## Tools

Just like regular agents, Realtime Agents can call tools to perform actions. You can define a tool using the same `tool()` function that you would use for a regular agent.

<Code lang="typescript" code={defineToolExample} />

You can only use function tools with Realtime Agents and these tools will be executed in the same place as your Realtime Session. This means if you are running your Realtime Session in the browser, your tool will be executed in the browser. If you need to perform more sensitive actions, you can make an HTTP request within your tool to your backend server.

While the tool is executing the agent will not be able to process new requests from the user. One way to improve the experience is by telling your agent to announce when it is about to execute a tool or say specific phrases to buy the agent some time to execute the tool.

### Accessing the conversation history

Additionally to the arguments that the agent called a particular tool with, you can also access a snapshot of the current conversation history that is tracked by the Realtime Session. This can be useful if you need to perform a more complex action based on the current state of the conversation or are planning to use [tools for delegation](#delegation-through-tools).

<Code lang="typescript" code={toolHistoryExample} />

<Aside type="note">
  The history passed in is a snapshot of the history at the time of the tool
  call. The transcription of the last thing the user said might not be available
  yet.
</Aside>

### Approval before tool execution

If you define your tool with `needsApproval: true` the agent will emit a `tool_approval_requested` event before executing the tool.

By listening to this event you can show a UI to the user to approve or reject the tool call.

<Code lang="typescript" code={toolApprovalEventExample} />

<Aside type="note">
  While the voice agent is waiting for approval for the tool call, the agent
  won't be able to process new requests from the user.
</Aside>

## Guardrails

Guardrails offer a way to monitor whether what the agent has said violated a set of rules and immediately cut off the response. These guardrail checks will be performed based on the transcript of the agent's response and therefore requires that the text output of your model is enabled (it is enabled by default).

The guardrails that you provide will run asynchronously as a model response is returned, allowing you to cut off the response based a predefined classification trigger, for example "mentions a specific banned word".

When a guardrail trips the session emits a `guardrail_tripped` event. The event also provides a `details` object containing the `itemId` that triggered the guardrail.

<Code lang="typescript" code={guardrailsExample} />

By default guardrails are run every 100 characters or at the end of the response text has been generated.
Since speaking out the text normally takes longer it means that in most cases the guardrail should catch
the violation before the user can hear it.

If you want to modify this behavior you can pass a `outputGuardrailSettings` object to the session.

<Code lang="typescript" code={guardrailSettingsExample} />

## Turn detection / voice activity detection

The Realtime Session will automatically detect when the user is speaking and trigger new turns using the built-in [voice activity detection modes of the Realtime API](https://platform.openai.com/docs/guides/realtime-vad).

You can change the voice activity detection mode by passing a `turnDetection` object to the session.

<Code lang="typescript" code={turnDetectionExample} />

Modifying the turn detection settings can help calibrate unwanted interruptions and dealing with silence. Check out the [Realtime API documentation for more details on the different settings](https://platform.openai.com/docs/guides/realtime-vad)

## Interruptions

When using the built-in voice activity detection, speaking over the agent automatically triggers
the agent to detect and update its context based on what was said. It will also emit an
`audio_interrupted` event. This can be used to immediately stop all audio playback (only applicable to WebSocket connections).

<Code lang="typescript" code={audioInterruptedExample} />

If you want to perform a manual interruption, for example if you want to offer a "stop" button in
your UI, you can call `interrupt()` manually:

<Code lang="typescript" code={sessionInterruptExample} />

In either way, the Realtime Session will handle both interrupting the generation of the agent, truncate its knowledge of what was said to the user, and update the history.

If you are using WebRTC to connect to your agent, it will also clear the audio output. If you are using WebSocket, you will need to handle this yourself by stopping audio playack of whatever has been queued up to be played.

## Text input

If you want to send text input to your agent, you can use the `sendMessage` method on the `RealtimeSession`.

This can be useful if you want to enable your user to interface in both modalities with the agent, or to
provide additional context to the conversation.

<Code lang="typescript" code={sendMessageExample} />

## Conversation history management

The `RealtimeSession` automatically manages the conversation history in a `history` property:

You can use this to render the history to the customer or perform additional actions on it. As this
history will constantly change during the course of the conversation you can listen for the `history_updated` event.

If you want to modify the history, like removing a message entirely or updating its transcript,
you can use the `updateHistory` method.

<Code lang="typescript" code={updateHistoryExample} />

### Limitations

1. You can currently not update/change function tool calls after the fact
2. Text output in the history requires transcripts and text modalities to be enabled
3. Responses that were truncated due to an interruption do not have a transcript

## Delegation through tools

![Delegation through tools](https://cdn.openai.com/API/docs/diagram-speech-to-speech-agent-tools.png)

By combining the conversation history with a tool call, you can delegate the conversation to another backend agent to perform a more complex action and then pass it back as the result to the user.

<Code lang="typescript" code={delegationAgentExample} />

The code below will then be executed on the server. In this example through a server actions in Next.js.

<Code lang="typescript" code={serverAgentExample} />
